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We introduce a new method for performing clustering with the 
aim of fitting clusters with different scatters and weights. It is de- 
signed by allowing to handle a proportion a of contaminating data 
to guarantee the robustness of the method. As a characteristic fea- 
ture, restrictions on the ratio between the maximum and the mini- 
mum eigenvalues of the groups scatter matrices are introduced. This 
makes the problem to be well defined and guarantees the consistency 
of the sample solutions to the population ones. 

The method covers a wide range of clustering approaches depend- 
ing on the strength of the chosen restrictions. Our proposal includes 
an algorithm for approximately solving the sample problem. 

1. Introduction. Many statistical practitioners view cluster analysis as a 
collection of mostly heuristic techniques for partitioning multivariate data. 
This arises from the fact that most cluster techniques are not explicitly based 
on a probabilistic model, and could lead to the feeling that no assumption is 
necessary and that the obtained results are "objective" (see the comments 
on page 123 in Flury [7]). However, objectiveness is far from reality and 
cluster results are most of the time strongly affected by the chosen method 
and its performance is very dependent on the underlying probabilistic model 
which the method implicitly assumes. 

For instance, when using fc-means, we must keep in mind that this method 
is designed for clustering spherical groups of roughly equal sizes and, thus, 
it is not reliable for analyzing constellations of groups that depart strongly 
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Fig. 1. (a) Two groups with 10% of the observation discarded (trimmed points are the 
small circles), (b) Three groups partition with no observations discarded. 

from this assumption. So, in order to understand clustering methods and 
decide what is more appropriate in a particular case, it is interesting to 
construct feasible models and develop suitably tailored methods for them. 

Determining appropriate models for clustering is even more important 
when noisy data or outliers are present. Without specifying a model, what 
we understand by an observation following an "anomalous" behavior is not 
clear. For instance, it is difficult to decide when a set of very scattered ob- 
servations should be considered as an extra proper group or merely as a 
background noise to be discarded (see Figure 1). Additionally, it is not obvi- 
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ous if a small group of tightly joined outliers should be considered as a proper 
group instead of a contamination phenomenon. Finally, note that the pre- 
cise detection of the outliers is an important task due to the serious troubles 
they introduce in standard clustering procedures (see, e.g., Garcia-Escudero 
and Gordaliza [12] and Hennig [19]) as well as the appealing interest that 
outliers could have by themselves after explaining why they depart from 
general behavior. 

Two general model-based approaches which provide a theoretically well- 
based clustering criterion in presence of outliers are (see Bock [2]) the mix- 
ture modeling and the trimming approach. To the first category belongs, say, 
the work by Fraley and Raftery [8], that considers mixture fittings with the 
addition of a mixture component accounting for the "noise," or McLachlan 
and Peel [23] that resorts to mixtures of t distributions. In this paper we 
are concerned with the trimming approach, previously introduced in Cuesta- 
Albertos, Gordaliza and Matran [4] and followed by recent proposals by Gal- 
legos [9, 10] and Gallegos and Ritter [11] (see also Garci'a-Escudero, Gordal- 
iza and Matran [14] and [15]). Notice that a "crisp" 0-1 approach is usually 
adopted in trimming approaches while some groups' ownership probabilities 
are generally returned by mixture modeling. Also, while mixture modeling 
tries to fit the outlying observations in the model, the trimming approach 
attempts to discard them completely. The methodology presented in this 
paper falls within the category of trimming approach methods and all the 
comparisons will be made within this category. 

To know how to perform the trimming in cluster analysis is not straight- 
forward because there exist no privileged directions for searching outlying 
values and, most of the time, we even need to remove observations which fall 
between the groups ("bridge" data points). The first attempt of trimming 
in clustering, through an "impartial" approach, appeared in [4] as a mod- 
ification of the /c-means method. Moreover, [12] shows that the impartial 
trimming provides better results in terms of robustness than the considera- 
tion of different penalty functions in the /c-means method (e.g., fc-medoids). 

The use of trimmed /c-means involves a considerable drawback because 
it implicitly assumes the same spherical covariance matrix for the groups 
(as classical fc-means does). The extension in [11] through the trimmed de- 
terminant criterion allows for a general expression of a common covariance 
matrix. Moreover, [11] also introduces there a statistical clustering model 
with outliers called the spurious-outlier model extending the usual statisti- 
cal clustering setup (Mardia, Kent and Bibby [20]) to include the presence 
of a proportion a of noise. This point of view leads to the consideration of 
the clustering method via maximum likelihood that we pursue in this paper. 

Unfortunately, the heterogeneous robust clustering problem (where dif- 
ferent groups' covariance matrices are admitted) is notably harder. The 
proposed objective function is now unbounded and the different "scales" 



L. A. GARCIA-ESCUDERO ET AL. 

(a) (b) 




Fig. 2. An unrestricted solution for the same data set in Figure 1 appears in (a) when 
k — 3 and a — 0.1. Compare with a restricted solution, also when k = 3 and a = 0.1, in 
(b). 



complicates the global ordering of the observations around their closest 
centers through Mahalanobis distances (see Garcia-Escudero and Gordaliza 
[13]). This motivates that unrestricted algorithms often find small clusters 
of points either grouped or almost lying in a lower dimensional space [Fig- 
ure 2(a)]. Adding some kind of restriction could allow us to obtain more 
informative partitions [Figure 2(b)]. 

A possible way of adding restriction has been considered in Gallegos [9, 10] 
by normalizing the covariances to have unit determinant in the steps followed 
in the algorithms there. This idea works nicely when the groups have similar 
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scales, but it does not work so well when very different scales are involved. 
This normalization can be too restrictive and it seems more adequate to 
incorporate the restrictions directly in the problem statement instead of 
(artificially) appearing in the algorithm. 

To address these difficulties, we introduce in our proposal constraints 
on the covariance matrices eigenvalues-ratio. A constant c will control the 
strength of the restrictions allowing a wide range of clustering problems. 

Another difficulty arises under the presence of different sizes for the un- 
derlying groups. Our proposal also includes, in a successful way, the consid- 
eration of different groups' weights to handle this difficulty. 

Existence results for both, the sample and the population problem, as 
well as the consistency of the sample maximizers to the population ones 
under mild assumptions are shown in Section 2. The proofs are sketched in 
the Appendix stressing on the importance of the eigenvalue restrictions to 
achieve these results. 

In Section 3, we propose a feasible algorithm (TCLUST) for approxi- 
mately solving the sample version of the problem. It may be seen as a classi- 
fication EM-algorithm (Celeux and Govaert [3]) where a kind of "concentra- 
tion" step as in fast-MCD algorithm (Rousseeuw and van Driessen [25]) is 
also applied. The eigenvalues-ratio restrictions will be imposed by solving a 
restricted least squares problem. Dykstra's algorithm in [6] may be applied 
for addressing that problem. Finally, in Section 4, we include a simulation 
study showing the gain provided by the proposed method with respect to 
other trimming proposals. 



2. Robust clustering and eigenvalues-ratio restrictions. We will consider 
throughout the paper a data set {x±, . . . ,x n } in the Euclidean space W. By 
/(•;/i,E), we will denote the probability density function (p.d.f.) of the p- 
variate normal distribution with mean fi and covariance matrix E. 

Under the spurious-outlier model, introduced in [11], the likelihood func- 
tion is given by 



(2.1) 



n n fi^H^) 



j=l ieRj 



with R = UjLi Rj and #R 



n 



na 



and where the parameter k denotes 



the total number of groups, Rj contains the indexes of the "regular" obser- 
vations assigned to group j and the remaining observations are considered 
spurious and obtained from some g^s, p.d.f.s in W. 

If E = a 2 I is chosen in (2.1), then we would be performing the trimmed 
fe-means method. An algorithm in the spirit of the fast-MCD (both coincide 
when k = 1) is provided in [11] for approximately maximizing (2.1). 
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Our modification of the "spurious-outlier" model considers different scat- 
ter matrices £j 's and assumes the presence of some underlying group weights 



7Tj's, with J2j=l "J 
(2.2) 



1. This leads to the maximization of 

k 



n n ^jf(xu^,^j) 

j = li€Rj 



with R = Ujf=i Rj an£ i i^R = n — [na] . Additionally, restrictions on the eigen- 
values of the Sj's matrices will be later introduced. 

As in [11], we can avoid the nonregular contribution to the previous max- 
imization problem when the gi's satisfy the condition 

k 

(2.3) argmaxmaxj [ ffjffa; fij, C argmax 9%{xi), 



n 



! R 3 



where 7Z stands for the set of all partitions of the indexes {l,...,n} onto 
k groups of regular observations, R, and a group containing the nonregu- 
lar ones, with #R = n — [na]. Note that the right-hand side in condition 
(2.3) only involves the nonregular observations and does not depend on the 
partition of the regular ones. Therefore, it simply means that any set of 
nonregular observations in every optimal partition maximizing (2.2) could 
be also obtained as a subset of [na] elements of the sample maximizing the 
likelihood corresponding to the noise. This condition easily holds under rea- 
sonable assumptions for the ^'s whenever the nonregular observations may 
be seen as merely "noise." For instance, examples for g^s shown in [9, 10] 
and [11] can be trivially considered here. We refer the interested reader to 
these papers for details. 

A better statement of our problem is obtained by introducing assignment 
functions Zj, j = 0, 1, . . . , k. For every point x in M p (not only the sample 



observations Xj's are classified), let us define Zj 



1 whenever x is assigned 



to the class Rj, j = 1, . . . , k, or zq(x) = 1 if it is being trimmed off. Through 
these functions, assuming that the c/j's may be omitted, we can raise again 
the problem in (2.2) to the maximization of 

1 Lj=l 

1 functions defined in the whole sample space verifying 
1 and J2?=i z o( x i) = [net]. This statement of the problem, tak- 



n 



where Zj are 



3=0 



Zj {Xi 



ing logarithms, leads to the following general one. 
Robust clustering problem: Given a probability measure P, maximize 



(2.4) 



E P 



X^OXiogTTj + log/(-;/ij,Sj)) 
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in terms of the assignment functions 

k 

Zj : MP i — > {0, 1} such that zj = 1 and EpZq(-) = a, 

j=0 

and the parameters 9 = (tti, . . . ,7Tfc,/zi, . . . Si, . . . , E^) corresponding 
to weights ttj S [0, 1], with Y^j=i n j = 1) mean vectors fij € W and sym- 
metric positively definite p x p-matrices Ej, j = 1, . . . , k. 

If P n stands for the empirical measure, P n = l/nX)"=i by replacing 
P by P n , we recover the original sample problem [notice that, perhaps, 
Ep n zo{-) — a cannot be exactly achieved but this familiar fact will not be 
important in our reasonings]. 

Our restrictions on the eigenvalues of the covariance matrices may be seen 
as an extension of those introduced by Hathaway [18] for univariate data. 
They avoid the singularities introduced by the possibility of very different 
S/s. 

(ER) Eigenvalues-ratio restrictions: We fix a constant c > 1 such that 

M n /m n < c 

for 

M n = max max Aj(Ey) and m n = min min Aj(IL-), 
j=l,...,k 1=1, ...,p j=l,...,kl=l,...,p 

A/(Sj) being the eigenvalues of the matrices Ej, I = 1, ...,p and j = 
l,...,k. The set of #'s which obey this condition is denoted by C . 

Note that c = 1 produces the strongest possible restriction. In this case, 
the proposed method may be viewed as a trimmed A:-means method with 
weights. However, the main advantage of this approach relies on the fact 
that the parameter c allows us to achieve certain (controlled) freedom in 
how we want to handle the different scattering of the groups. 

Figures 1 and 2 show the results of the application of the proposed 
methodology (by using the TCLUST algorithm described in Section 3) to 
a data set made up of 3 bivariate Gaussian clusters where the most scat- 
tered one accounts for 10% of the data. The result when k = 2, a = 0.1 and 
c = 5 appears in Figure 1(a). The result there is not very dependent on c 
as long as the two (main) groups are not too different in their eigenvalues 
once the most scattered group was trimmed off. The values k = 3 and a = 
are considered in Figure 1(b), with a large value for c (c = 50) which allows 
for the presence of the more scattered group. The values k = 3 and a = 0.1 
were applied in Figure 2. A rather large c (unrestricted problem) was cho- 
sen in Figure 2(a) while a small c = 1 (restricted problem) was considered 
in Figure 2(b). 
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To exclude in the subsequent analysis those probability distributions ob- 
viously unappropriate for the introduced approach, we will assume on the 
underlying distribution P the following mild condition. (It trivially holds for 
absolutely continuous distribution or for empirical measures corresponding 
to a sample large enough from an absolutely continuous distribution.) 

(PR) The distribution P is not concentrated on k points after removing a 
probability mass equal to a. 

To conclude this section, we will notably simplify our problem through 
an adequate reformulation that leads to expressing the assignment functions 
£,-'s only in terms of 9. This will also be a keystone for deriving our algorithm 
to solve the sample counterpart of the problem. 

Given 9 € O c , we consider discriminant functions defined as 



Note that these are familiar functions in the application of Bayes' rules in 
discriminant analysis. These functions will also serve to provide an "outly- 
ingness" measure of the observations. 

Using previous definitions, for a given 9 and a probability measure P, we 
consider the distribution function of D{-;9) and its corresponding a-quantile: 



(2.6) G{u;9,P):=P{D(-;9)<u) and R(9, P) := inf{G(u; 9, P) > a}. 



With this notation, we have the following straightforward characterization 
for the assignment functions: 

Proposition 1. The robust clustering problem can be simplified, using 
the discriminant functions (2.5), to the maximization in 9 of 



Dj(x; 9) = irjf(x; fij, Ey) and 



(2.5) 



D{x; 9) = max{£>i(x; 9),..., D k {x; 9)}. 



u 



(2.7) 




where the assignment functions are obtained from 9 as 



Zj (x; 9) = I{x : {D(x; 9) = Dj(x; 9)} n {Dj(x; 9) > R{9, P)}} 



and 



k 



z (x;9) = 1 -^2zj(x;9). 



3=1 
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That is, we assign x to the class j with the largest discriminant function 
value Dj(x;9) or x is trimmed off when all the Dj(x;0)'s [and consequently, 
D{x;9)] are smaller than R(9,P). (In order to break ties in the discriminant 
function values, the lexicographical ordering could be applied.) 

The relevant mathematical results to be considered are given in the fol- 
lowing propositions. 

Proposition 2 (Existence). // (PR) holds for distribution P, then 
there exists some 6 € O c such that the maximum of (2.4) under (ER) is 
achieved. 

By examining the proof of previous result, we can see that although we 
have admitted weights ttj = 0, this is not a drawback when taking log7Tj 
because in this case Zj(-;6) = and then the set {x : Zj(x;9) = 1} is empty. 
The presence of groups with zero weight does actually happen in practice. 
For instance, when k = 2, c=l, a = and P is the iV(0, 1) distribution in 
the real line, we can see that 9 = (iri,ir2, fJ-i, fJ^,Oi,of) = (1, 0, 0,/^, 1, 1) is 
the optimal solution for every /12 £R and c > 1 . 

Proposition 3 (Consistency). Assume that P has a strictly positive 
density function and that 9q is the unique maximum of (2.4) under (ER). If 
9 n 6 O c denotes a sample version estimator based on the empirical measure 
P n , then 9 n — > #0 almost surely. 

Remark 1. Notice that a uniqueness condition is needed in order to 
establish the consistency result. Unfortunately, this property does not always 
hold. For instance, think of a symmetric mixture P in the real line with two 
well-separated modes, a high trimming level and k = 1. That uniqueness 
property was already needed for establishing the same consistency result 
for the trimmed fe-means and, even in this simpler case, the statement of 
general uniqueness results was difficult (see Remark 4.1 in [14]). However, 
as in the trimmed /c-means problem, we believe that it is quite rare to find 
distributions where the uniqueness fails when dealing with "reasonable" data 
for clustering and when parameters k and a have been properly chosen. 

3. The TCLUST algorithm. The empirical problem presented in Section 
2 has a very high computational complexity. An exact algorithm seems to be 
not feasible even for moderate sample sizes. Thus, the existence of adequate 
algorithms for approximately solving the sample problem is as important as 
the procedure itself. With this in mind, we propose the TCLUST algorithm 
(an R-code implementation is available at http://www.eio.uva.es/~langel/ 
software), an EM-principle based algorithm, intended to search for approx- 
imate solutions. The EM algorithm (Dempster, Laird and Rubin [5]) is the 
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usual method for obtaining a solution to the mixture likelihood problem. 
Here, as we follow a "crisp" approach where each point is uniquely assigned 
to one cluster, a classification EM approach (Celeux and Govaert [3]) is 
preferable. Moreover, as trimmed observations are allowed, the rationale be- 
hind the fast-MCD [25] and behind the trimmed fc-means algorithm [15] will 
also underly. 

The TCLUST algorithm may be described as follows: 

1. Randomly select starting values for the centers mj-'s, the covariance ma- 
trices S^'s and the weights of the groups p^'s for j = 1, . . . , k. 

2. From the 9 l = (p[, . . . ,p l k , m\, . . . , m l k , S[, . . . , S l k ) returned by the previous 
iteration: 

2.1. Obtain di = D(xi,9 l ) for the observations {x\, . . . ,x n } and keep the 
set H having the [n(l — a)] observations with largest dj's. 

2.2. Split H into H = {H±, . . . , H k } with H, = {xi£H: D 3 { Xi ,9 l ) = D(x u 
0% 

2.3. Obtain the number of data points rij in Hj and their sample mean 
and sample covariance matrix, mj and Sj, j = 1, . . . , k. 

2.4. Consider the singular-value decomposition of Sj = UjDjUj where 
Uj is an orthogonal matrix and Dj = diag(Aj) is a diagonal matrix 
(with diagonal elements given by the vector Aj). If the full vector 
of eigenvalues A = (Ai, . . . , A&) does not satisfy the eigenvalues-ratio 
restriction, obtain (for instance) through Dykstra's algorithm a new 
vector A = (Ax, . . . , A^) obeying the (ER) restriction and with ||A — 
A _1 || 2 being as smaller as possible. (A -1 denotes the vector made up 
by the inverse of the elements of the vector A.) Notice that the (ER) 
restriction for A corresponds exactly to the same (ER) restriction 
applied to A -1 . 

2.5. Update 6 l+1 by using: 

• ^ +1 ^V[n(l-a)], 

• ^ U'jDjUj and Dj = diag(A j )- 1 . 

3. Perform F iterations of the process described in step 2 (moderate values 
for F are usually enough) and compute the evaluation function L(6 F ; P n ) 
using expression (2.7). 

4. Start from step 1 several times, keeping the solutions leading to minimal 
values of L(9 F ,P n ) and fully iterate them to choose the best one. 

The computed (E-step) "a posteriori" probabilities, Dj(xi,9 l ) = pjf(xi;mj, 
Sj), are converted to a discrete classification leaving unassigned the propor- 
tion a of observations which are the hardest to classify. It is easy to see that 
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this leads us to an optimal assignment. We later obtain a new 9 + by maxi- 
mizing (M-step) the conditional expectation. Proposition 4 guarantees that 
the presented algorithm can be applied for performing this maximization. 
Notice that the obtention of the optimal scatter matrices is decomposed 
into the search of the corresponding optimal eigenvalues and eigenvectors. 
For every choice of eigenvalues, the best eigenvectors choice simply follows 
from those derived from the sample covariance matrices of the observations 
in each group. This decomposition is somehow similar to that considered in 
Gallegos' proposal, where "shapes" and "scales" were separately handled. 

Seeing D(xi,9 l ) as an inverse outlyingness measure for the observation Xi 
with respect to the choice of 9 l , then step 2 may be seen as a concentration 
step. [13] analyzes some other attempts for extending the concentration step 
principle to the heterogeneous robust clustering setup. 

Recall that the random initialization scheme (step 1) and the final refine- 
ment (step 4) will be very important as happened in the fast-MCD algorithm 
or in Maronna [21]. For initializing the procedure in step 1, we have seen 
that simply randomly choosing k sample data points for the centers, k iden- 
tity matrices for the covariances and the same weights for the groups (equal 
to 1/k) provide reasonably starting values in most of the cases. 

With respect to the eigenvalues-ratio restriction, we would need A = 
(Ai, . . . , Afc) with Aj = (Aij, . . . , X p j) belonging to the cone 

(3.1) C = {(Ai, . . . , A fe ) € W pxk : X U)V - c • A r , s < for all (it, v) + (r, a)}. 

If A ^ C, we must replace A -1 by A £ C with minimal ||A — A" 1 )) 2 . Dykstra's 
algorithm serves to approximately solve that restricted least squares problem 
as long as C is the intersection of the several closed convex cones 

C h = {(A l ,..., A fe ) € R pxk : X u , v - c • A r , s < 0} for h = (u, v, r, s), 

by resorting to iterative projections onto the individual cones C^s. (Notice 
that the projections onto the cones Ch are very fast to obtain.) Thus, a fixed 
number of individual projections may be done retaining the best attained 
solution after these iterations and satisfying the restrictions. Alternatively, 
quadratic programming based solutions (see, e.g., Goldfarb and Idnani [17]) 
for that constrained minimization may be explored. 

The next result formalizes the appropriateness of the TCLUST algorithm. 

Proposition 4. // the sets Hj = {x^ : Zj(xi) = 1}, j = 1, ■ • ■ , k, are kept 
fixed, the maximum of (2.4 ) for P = P n can be obtained through the following 
steps: 

(i) Fixed \la and Sj, the best choice of ttj is irj = rij/[n{\ — a)] where 
rij is the cardinal of set Hj . 
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(ii) Fixed Ej and the optimal values for itj given in the best choice 
for fij is the sample mean nij of the observations in Hj . 

(iii) Fixed the eigenvalues for the matrix Hj and the optimum values 
given in (i) and (ii) for Hj and fij, the best choice for the set of unitary 
eigenvectors are the unitary eigenvectors of the sample covariance matrix 
Sj of the observations in Hj . 

(iv) With the optimal selections made in (i), (ii) and (iii), the best choice 
for the eigenvalues corresponds to the inverse of the projection of the vector 
containing the inverse of the eigenvalues onto the cone C. 

Proof. Once the Zj{xi) for i = 1, . . . , n and j = 0, . . . , k are known val- 
ues, the expression (2.4) can be written as 



k 



(3-2) E 



n 



logvr i + E log/(xi;/ij,Sj) 



and the assertions (i) and (ii) trivially hold. 

Considering these optimal values for 7Tj and Uj, together with the cyclic 
property of the trace, the maximization of (3.2) simplifies to the minimiza- 
tion of 

k 

E[log|E i | + trace(ST 1 5 i )]. 
j"=i 

The matrices Sj and Ej can be decomposed into Sj = UjDjUj and Ej = 
VjEjVj, where Dj = diag(Aj) and Ej = diag(Hj) are diagonal matrices with 
Aj = (Aij, . . . , X p j) and Hj = (£i,j, ■ • ■ , £ P ,j), and Uj and Vj are orthogonal 
matrices. So, as log|Sj| = log|£j| and the eigenvalues Ej were fixed, the 
previous minimization problem can be further simplified to that of 

k k 

E trace(S- 1 ,Sj) = E tTace(Ej 1 (U j V!)' Dj(UjV<)) 

3=1 j=l 

(the cyclic property of the trace is again applied). Denote Tj = UjV- and 
rewrite 



(3.3) trace^T^T)) = E E ' & j 

7i 7) S«,j 



where i w j denotes the element (u,v) of the matrix Tj. Tj is an orthogo- 
nal matrix, so we have that X^uu,j = 1 an d Ylv^uvj = 1- Therefore, the 
minimization of (3.3) may be seen as a linear programming problem like 

^E ° u > v ' Xu > v sub J ect to E Xv » v = E Xu > v = 1 and Xu > v - °' 



mm 
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with known coefficients c U)V [notice that \ u ,j/£u,j are fixed coefficients be- 
cause X u j depends on the data set at hand and the £ u j are supposed known 
quantities in (iii)]. Although fractional solutions are possible, these solutions 
will never be basic feasible ones due to the particular statement of the linear 
programming problem (see, e.g., Papadimitriou and Steiglitz [24], page 249). 
Consequently, the optimal solution corresponds to a "real matching" where 
the optimal t\ v are or 1. Thus, Tj is a permutation matrix product of the 
orthogonal matrices Uj and VL It is quite easy to see that the columns of 
the matrices Uj and Vj must provide the same set of unitary eigenvectors 
and, thus, the assertion (iii) is proven. 

By applying (i) , (ii) and (iii) , we finally need to search for a vector E = 
(Hi, ... , E fc ) and Ej = faj,. . . , f PjJ -) minimizing 

( 3 - 4 ) EE('°g?M + r) =EE(- 1 og^M+A«, r Ay). 

3=1 t=l V ? « J 3=1 i=l 

with Aj j = l/£y • As (3.4) is a convex function on the Ajj and its unrestricted 
minimum is attained when Ajj = A"^ 1 , the minimization of (3.4) under the 
eigenvalues-ratio restriction possed by (3.1) leads us to the optimal choice 
of A with minimal ||A — A _1 || 2 and A G C. □ 

Remark 2. Alternative methods can be defined by imposing restric- 
tions on the ratio between covariance determinants instead of controlling 
eigenvalues. Gallegos [9] and [10] proposal scales the covariance matrices to 
have determinant ratio equal to 1 in the algorithm. Maronna and Jacovkis 
[22], in the untrimmed case a = 0, consider that normalization as the only 
reliable "distance" for clustering multivariate data. Here, the proposed al- 
gorithm can be easily adapted for handling restrictions of this type. In this 
case, the cone would be 

C' = {(o"i, • ■ • , (Jk) € K fc '■ cr u — c • <t v < for all u 7^ v}, 

where the factorization in step 2.4 of the previous algorithm is Sj = dj ■ Uj 
with \Uj \ = 1 and Uj = [Sjl 1 ^. If c = 1 in C , we would obtain an analogous 
to Gallegos' proposal with group weights. 

Other procedures which have been used for avoiding pathological solutions 
in the heterogeneous robust clustering problem are based on adding different 
types of parameterizations for the covariance matrices (see, e.g., Scott and 
Symons [26] or Banfield and Raftery [1]). Although that possibility has not 
been considered here, we believe that similar ideas (based on relaxing those 
parameterizations) could be interesting. 

Remark 3. The proper determination of parameters a, k and c is not 
an easy problem in general. Users of cluster analysis methods sometimes 
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have initial guesses of suitable values for these parameters, but many times 
these are completely unknown. The careful analysis of the objective func- 
tions when moving k and a provided useful information for choosing k and 
a in [15]. The objective function for the trimmed /c- means method always 
improves when increasing k (see Lemma 2.2 in [4]). Here, the possible exis- 
tence of groups with ttj = would imply that the value of the objective func- 
tion does not necessarily improve when increasing k. However, this property 
could even be more interesting in order to develop techniques for choosing 
k because a irj close to suggests that an smaller k could be needed. 

Moreover, as an anonymous referee suggested to us, we can make use 
of Bayes factors as in Van Aelst et al. [27] in order to know how well 
the observation Xj is integrated in the cluster in which it was assigned. 
If D^(xi]9) < • • • < D( k }(xi\9), the Bayes factor for a nontrimmed observa- 
tion xi is defined as BF(i) = log(D^-i){xi',9)/D^)(xi;9)). Notice that the 
smaller the Bayes factor is the better is the assignment to its corresponding 
cluster. The existence of clusters with many observation with large Bayes 
factors (close to 0) suggests that perhaps an improper choice for c was made. 
Additionally, we can introduce Bayes factors for the trimmed observations 
as BF(i) = log(D( k j(xi,8)/R(9,P n )) to measure the strength of the consid- 
eration of the trimmed data point %>i £LS £111 outlier. 



4. A simulation study. A simulation study has been carried out to com- 
pare the performance of the proposed robust clustering method with respect 
to other trimming approaches in the literature. Several data sets of size 
n = 2000 have been generated. Each data set consists of three simulated p- 
dimensional normally distributed clusters with centers ji\ = (0,8,0, . . . ,0)', 
fi2 = (8, 0, 0, ... , 0)' and ^3 = (—8, —8, 0, . . . , 0)' and covariance matrices 



and 



= diag(l,a,l,.. 




= diag(6, c, 


1,.. 


,1) 


fde 






S3 = e / 


) 


V 







The constants a, b, c, d, e and / serve to control the true differences between 
the eigenvalues of the groups' covariance matrices. This leads us to the 
consideration of the following cases: 

(Ml) (a,b,c,d,e, f) = (1,1,1,1,0,1): Spherical equally scattered groups. 
(M2) (a,b,c,d,e, f) = (5,1,5,1,0,5): Not spherical but the same covari- 
ance matrices for the groups. 
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(M3) (a, b, c, d, e, /) = (5, 5, 1, 3, — 2, 3): Different covariance matrices but 
the same scale (equal determinant). 

(M4) (a,b,c,d,e, f) = (1,20,5,15,-10,15): Groups with different scales. 

(M5) (a, b, c, d, e, /) = (1,45, 30, 15, — 10, 15): Groups with different scales 
and two of them with a severe overlap. 

We consider 1800 "regular" data points with two different group propor- 
tions. We also generate uniformly distributed data points in a parallelo- 
gram defined by the coordinatewise ranges of the regular data points. Using 
an acceptance-rejection algorithm, only points having squared Mahalanobis 
distances from /ii, fi2 and [is (using Si, S2 and S3) greater than Xp,o.975 
are finally considered until reaching an amount of 200 outliers. 

The following approaches searching for k = 3 groups and with a trimming 
proportion a = 0.1 are tried: 

(TkM) Trimmed k-means (specially aimed to the case Ml). 
(G&R) Gallegos and Ritter's method (specially aimed to the case M2). 
(G) Gallegos^ proposal (specially aimed to the case M3). 
(TCLUST) The presented algorithm with an eigenvalues-ratio restriction 
when c = 50. 

The same number of random initializations and concentration steps are 
taken for all the methods. 

Table 1 shows the average proportion of misclassified observations for 
B = 100 independent random samples of size 2000 when p = 2 and 6 and the 
group weights satisfy proportions 1:1:1 ("equal") and 1:2:2 ("unequal"). The 
numbers within parenthesis in the Table 1 show the proportion of outliers 
wrongly determined as nonoutliers and vice versa. 

Notice that all the methods work nicely under the underlying model in 
which they are specially aimed. However, the proposed eigenvalues-ratio 
restriction method is the only method which is able to cope with the mixtures 
with very different scales (mixtures M4 and M5), and it seems to be less 
affected in the unequal groups' size case. Figure 3 shows the result of these 
four analyzed procedures applied to the same data set generated by the 
simulation scheme M5 when p = 2 and unequal weights. The TCLUST seems 
to be the only one that is able to distinguish between the least and the most 
scattered groups even in this rather overlapped case. 

APPENDIX: PROOFS OF EXISTENCE AND CONSISTENCY 

A.l. Existence. The existence of solutions for our problem can be ob- 
tained through a standard argument starting by considering a sequence 
{0»}£=i = {(*?, ... ,7r£, /x?, E?, ... , Eg)}*! such that 

(A.l) lim L(9 n ,P) = sup L(9, P) = M > -00 
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Table 1 

Misclassification rates for the simulation study 



Weights Dimen. Model 



TkM 



GfcR 



TCLUST 



p — 


9 


1V11 


n 

u 


m 1 


fn ni i ^ 


n 
u 


m i 


{ u 


ni i 


n 
u 


ni 9 




n 
u 


ni 9 


(r\ m 9"i 






M2 


n 
u 


n^7 

UO 1 


(ft m7\ 


n 

u 


U1U 


\ u 




n 
u 


U1U 


(ft ni p,\ 


u 


ni is 

U1U 


(ft ni (\\ 






IVIO 


n 

u 


uou 


(ft fti.p.\ 


n 

u 


uoo 


{ U 


UOO J 


n 

U 


ni ^ 

UIO 


(ft ftTA\ 


n 

u 


ni ^ 

UIO 


(ft ni a\ 

^U.Ul^l J 






M4 





079 


(0.057) 





059 


(0 


049) 





042 


(0.031) 





020 


(0.019) 






M5 





129 


(0.046) 





131 


(0 


046) 





122 


(0.042) 





043 


(0.022) 


p = 


6 


Ml 





008 


(0.008) 





008 


(0 


008) 





008 


(0.008) 





009 


(0.009) 






M2 





035 


(0.035) 





012 


(0 


012) 





012 


(0.012) 





012 


(0.012) 






M3 





032 


(0.032) 





018 


(0 


017) 





011 


(0.011) 





011 


(0.011) 






M4 





094 


(0.072) 





038 


(0 


025) 





018 


(0.016) 





015 


(0.014) 






M5 





159 


(0.077) 





119 


(o 


026) 





055 


(0.021) 





035 


(0.015) 


p = 


2 


Ml 





011 


(0.011) 





011 


(o 


011) 





011 


(0.011) 





011 


(0.011) 






M2 





037 


(0.037) 





017 


(o 


017) 





017 


(0.017) 





016 


(0.016) 






M3 





036 


(0.036) 





034 


(0 


033) 





015 


(0.015) 





014 


(0.014) 






M4 





089 


(0.059) 





069 


(0 


051) 





047 


(0.032) 





021 


(0.019) 






M5 





151 


(0.047) 





166 


(0 


048) 





147 


(0.044) 





047 


(0.023) 


p = 


6 


Ml 





008 


(0.008) 





009 


(0 


009) 





009 


(0.009) 





009 


(0.009) 






M2 





034 


(0.034) 





011 


(0 


011) 





012 


(0.012) 





012 


(0.012) 






M3 





033 


(0.033) 





018 


(0 


018) 





012 


(0.011) 





011 


(0.010) 






M4 





107 


(0.074) 





053 


(0 


023) 





025 


(0.017) 





017 


(0.015) 






M5 





186 


(0.081) 





166 


(0 


024) 





078 


(0.022) 





039 


(0.015) 



[the boundedness from below for (A.l) can be easily obtained just consider- 
ing 7Tx = 1, fJ-i = 0, T,i = I, and setting the other weights as with arbitrary 
choices of means and variances]. 

Since [0, l] k is a compact set, we can extract a subsequence from {# n }Jf 
(that will be denoted like the original one) such that 

(A.2) 7T™ -> ttj € [0, 1] for l<j<k, 

and satisfying for some g € {0, 1, . . . , k} (a relabeling could be needed) that 
(A.3) tf^fijeR p iov0<j<g and min -> oo. 

j>9 

With respect to the scatter matrices, under (ER), we can also consider a 
further subsequence verifying one (and only one) of these possibilities: 

(A.4) E" -» Ej for l<j<k, 

(A. 5) M n = max max Xi(T,j) — > oo 

j=l,...,kl=l,...,p 

or 

(A. 6) nT-n= m i n m i n A^(S,) — >■ 0. 

j=l,...,k l=l,...,p 
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TkM G&R 




G TCLUST 




-20 -10 10 20 -20 -10 10 20 

Fig. 3. Clustering results when k — 2 and a — 0.1 for a simulated data following the M5 
scheme in the text with p = 2 and unequal weights: Trimmed k-means (TkM); Gallegos 
and Ritter (G&lK); Gallegos (G) and the presented algorithm ^TCLUSTJ with c = 50. 



Lemma A.l. If (ER) holds and if P satisfies (PR), then only the con- 
vergence (A. 4) is possible. 



Proof. We will see that (A. 5) or (A. 6) would imply lim^oo L(9 n , P) = 
oo. Let A™ • := Aj(S^) be the eigenvalues, j = 1, . . . , k and I = 1, . . . , p, of 



the group covariance matrices and 
vectors. Then we have 



1 their associated unitary eigen- 



L(6 n ,P)=E P 



■ k 

E 



*,•(-; ^logTr? -|log27r- If^logXfj 
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5E(^) _1 (- 



A«7 



1=1 



(A.7) 



<E T 



^^(•;^)(logvrJ-|log27r 
■i=i 



P -\o g m n - l -M- 1 \\--^f 



If we assume that M n — > oo holds, then m n — > oo by (ER). Thus, we would 
have that L(9 n ,P) — > —oo leading to a contradiction with (A.l). 

Now assume that (A. 6) holds. We can guarantee by Lemma A. 2 below 
that if P satisfies (PR), then there exists a constant h such that 



(Ai 



E P 



Lj=l 



-u n \\ 2 



>h>0. 



Since log7r™ < 0, the fact that P[zi(-) + • • • + Zfc(-)] = 1 — a implies 
L(0 n ,P)<{l-a) 



P P 
--log2vr - -logm 7 



Lj=l 



3 \ ) v n) 



n\\2 



Therefore, (ER) and (A. 8) give 

(A.9) L(0 n ,P)<(l-a)(-|log27r-|logm n ) -^{cm n y l h. 

But this upper-bound in (A.9) tends to — oo as m n — > 0. □ 

The following lemma has been applied in the proofs of Lemma A.l. 

Lemma A. 2. If P satisfies condition (PR), then there exists a constant 
h > such that inequality (A. 8) holds. 

Proof. The trimmed /c-means problem was introduced in [4] as the 
search of k points mi, . . . , in M p and a Borel set B minimizing: 



(A.10) 



mm 

B:P(B)>1 



min — - — - / inf llx — m,|| dP(x). 
l- a mi,...,m k P(B) Jb i<i<fe 
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Theorem 3.1 in [4] guarantees the existence of solutions for this problem. 

Thus, (A. 10) attains a minimum value that we denoted by V a ^- Now, for 
every choice of 6, we have 



Ep 



■i=i 



> E P 



£*(■;«> "J 

■3 =1 



> (1 - a)V a , k 



because \Jj =1 {x : Zj(x; 6) = 1} is a Borel set having probability greater or 
equal than 1 — a. Finally, we can trivially see that h := (1 — ct)V ay k > 
whenever condition (PR) holds for P. □ 

The next step is to show that whenever the classes in the optimal partition 
have strictly positive probability masses we can guarantee the convergence 
of the centers fjij . This result has also key importance in order to understand 
the role played by the weights irj's in this approach. 

Lemma A. 3. When (ER) and (PR) hold, if every ttj in (A. 2) verifies 
ttj > for j = 1, . . . , k, then g = k in (A. 3) . 

Proof. If g = 0, we can take a ball with center and radius big enough 
B(0,R) such that P[B(0, R)] > a. We can thus easily see that 

" k 

E P J2zj(-,0 n )\\--^\\ 2 ^oo, 
-j=i 

so that L(6 n ,P) — > — oo from (A. 7). Notice that (ER) is also here applied. 
When g > 0, we prove first that 

k 



(A.ll) 



E P 



J2 Z ^ Q r, 

■i=ff+i 



0. 



This arises from the dominated convergence theorem taking into account 
that the sequence is obviously bounded by 1 — a, and the fact that 

(A. 12) {x:zj(x;0 n ) = l}c\x: max DJx; n ) > Dx{x; B n ) \ 

I j=g+l,...,k J 

for j = g + 1, . . . ,k, where the right-hand side converges toward the empty 
set, when n tends to oo, due to (A. 3) and (A. 4). 
We can now use (A.ll) in order to get 

' 9 / v 1 

lim supL(6 n ,P)< lim E P V Zj (-; 9 n ) logvr" - y - log2^ - - log 



i=i 



(._ M »y(s^)-i(._^) 
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Ep 



Lj=l 



£ ^ (•; 9) (log vr,- - | log 2vr - - log |S 



where x \— > Zj(x;9) are the assignment functions which would be derived 
when working with g (instead of k) populations and 9 being equal to a limit 
of the subsequence {0 n }%=i = {K, • • • , rf, fl, . . . , £?, . . . , S^)}~ x . 

As X^i 71 ".?' < 1) the proof ends up by showing that we can change the 
weights 7ri,...,7r fc by 



(A.13) 



7T, 



for 1 < i < g and vr* +1 







(and properly modifying the assignment functions Zj's). This change pro- 
duces a strict decrease in the objective function, leading to a contradiction 
with the optimality stated in (2.7). Thus, we conclude g = k. □ 



Proof of Proposition 2. Taking into account previous lemmas, we 
have that one of the two possibilities must hold. 

(i) If 7r™ — > iij > for 1 < j < k, then the choice of 9 is obvious. 

(ii) If 7r™ — > 7Tj > with 7Tj > for j < g and 7Tj = for g < j < k, we 
can define weights ttj as ttj = ]im n ^ f00 iTj' for j = 1, . . . ,g and 7r g +i = • • • = 
7Tfc = 0, and, take fij = lim n ^oo fjij and Sj = lim^-xx, E" for j < 5. The other 
/ij's and Ej's may be arbitrarily chosen (of course, satisfying the eigenvalues- 
ratio restrictions). □ 

A. 2. Consistency. Given {x n }^ = i an i.i.d. random sample from an un- 
derlying (unknown) probability distribution P, let {9 n }™ = i = {(vrf, . . . , 7rj£, 
. . . , S™, . . . , S^)}^! C C denote a sequence of sample estimators ob- 
tained by solving the problem (2.4) for the empirical measures {P n }^ =1 
with the eigenvalue-ratio restrictions (ER) for a fixed c > 1 [from Proposi- 
tion 2 such a sequence does always exist, for large enough n, whenever P is 
an absolutely continuous distribution verifying (PR)]. Notice that although 
similar notation to that applied in Section A.l is here used, the index n will 
now indicate the dependence on the sample size n. 

The proof of the consistency combines arguments already used to prove 
the existence and techniques in the modern theory of empirical processes 
(see, e.g., Van der Vaart and Wellner [28]). We limit ourselves to state the 
key results as lemmas. The complete proofs can be obtained in [16]. 

First, we prove that there exists a compact set K C@ c such that 9 n G K 
for n large enough with probability 1. This follows from the next lemmas. 
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Lemma A. 4. If P is an absolutely continuous distribution [thus verifying 
(^PR)/ then the minimum (resp. maximum) eigenvalue, m n (resp. M n ) of 
the matrices X™, j = l,...,k, cannot verify m n — > (resp. M n — > oo). 

Lemma A. 5. If P is an absolutely continuous distribution, then we can 
choose empirical centers [i™ , j = l,...,k, such that their norms are uni- 
formly bounded with probability 1. 

The following lemmas, related to uniform convergences, complete our 
technical needs for the final proof. The second involves R(9;P) in (2.6). 



Lemma A. 6. Given a compact set K, the class of functions 

(A.i4) n -.= | i K oo)(£>(-; o)) J2 *; (■; o) iog^(- ; e) ■. e e k, u > o| 

is a Glivenko-Cantelli class, with Zj(x;0) = I{x:D(x;9) = Dj(x;9)}. (All 
the points in M. p are assigned to some class through the z^'s.) 

Lemma A. 7. Let P be an absolutely continuous distribution with an 
strictly positive density function. Then for every compact subset K , we have 
that 



(A.15) 



sup \R(9;P n )-R(0;P) | -*•(), P-a.e. 
9eK 



We can now prove the stated consistency result. 



Proof of Proposition 3. Let K be a compact set such that 9 n <E K 
for n > no with probability 1. The objective function in the empirical case 
can be rewritten as: 

fe 



L{9, P n ) 



{x:D(x,e)>R(8;P„)} 



■i=i 



dP n (x), 



with the Zj's introduced in Lemma A. 6. Let us introduce 



L(9,P n )-- 
We can see that 



{x:D(x,e)>R{9;P)} 



Y / z*{x;9)logD j {x;( 



dPJx). 



S np\L(9;P n )-L(9-P n )\=o P (l), 
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by using Lemma A. 7 and the fact that the integrand can be bounded from 
above and below from some constants uniformly for 9 in the compact set K. 

Finally, we resort to the Glivenko-Cantelli property for the class of func- 
tions H in (A. 14), and apply Theorem 3.2.3 in [28] to achieve the result. 
□ 
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